Google Books Ngrams Recompressed and Searchable
Identifieur interne : 000216 ( Main/Exploration ); précédent : 000215; suivant : 000217Google Books Ngrams Recompressed and Searchable
Auteurs : Szymon Grabowski [Pologne] ; Jakub Swacha [Pologne]Source :
- Foundations of Computing and Decision Sciences [ 0867-6356 ] ; 2012-12-01.
Abstract
One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.
Url:
DOI: 10.2478/v10209-011-0015-8
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 002F20
- to stream Istex, to step Curation: 002C89
- to stream Istex, to step Checkpoint: 000006
- to stream Main, to step Merge: 000220
- to stream Main, to step Curation: 000216
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Google Books Ngrams Recompressed and Searchable</title>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</author>
<author><name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:1A96B9FC68E740E8E445E06B684EE12892B17EDC</idno>
<date when="2012-12-22" year="2012">2012-12-22</date>
<idno type="doi">10.2478/v10209-011-0015-8</idno>
<idno type="url">https://api.istex.fr/document/1A96B9FC68E740E8E445E06B684EE12892B17EDC/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">002F20</idno>
<idno type="wicri:Area/Istex/Curation">002C89</idno>
<idno type="wicri:Area/Istex/Checkpoint">000006</idno>
<idno type="wicri:doubleKey">0867-6356:2012:Grabowski S:google:books:ngrams</idno>
<idno type="wicri:Area/Main/Merge">000220</idno>
<idno type="wicri:Area/Main/Curation">000216</idno>
<idno type="wicri:Area/Main/Exploration">000216</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Google Books Ngrams Recompressed and Searchable</title>
<author><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
<affiliation wicri:level="1"><country xml:lang="fr">Pologne</country>
<wicri:regionArea>Lodz University of Technology, Institute of Applied Computer Science, al. Politechniki 11, 90-924 Łódź</wicri:regionArea>
<wicri:noRegion>90-924 Łódź</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
<affiliation wicri:level="1"><country xml:lang="fr">Pologne</country>
<wicri:regionArea>University of Szczecin, Institute of Information Technology in Management, Mickiewicza 64, 71-101 Szczecin</wicri:regionArea>
<wicri:noRegion>71-101 Szczecin</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Foundations of Computing and Decision Sciences</title>
<idno type="ISSN">0867-6356</idno>
<idno type="eISSN">2300-3405</idno>
<imprint><publisher>Versita</publisher>
<date type="published" when="2012-12-01">2012-12-01</date>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">4</biblScope>
<biblScope unit="page" from="271">271</biblScope>
<biblScope unit="page" to="281">281</biblScope>
</imprint>
<idno type="ISSN">0867-6356</idno>
</series>
<idno type="istex">1A96B9FC68E740E8E445E06B684EE12892B17EDC</idno>
<idno type="DOI">10.2478/v10209-011-0015-8</idno>
<idno type="ArticleID">v10209-011-0015-8</idno>
<idno type="Related-article-Href">v10209-011-0015-8.pdf</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0867-6356</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">One of the research fields significantly affected by the emergence of “big data” is computational linguistics. A prominent example of a large dataset targeting this domain is the collection of Google Books Ngrams, made freely available, for several languages, in July 2009. There are two problems with Google Books Ngrams; the textual format (compressed with Deflate) in which they are distributed is highly inefficient; we are not aware of any tool facilitating search over those data, apart from the Google viewer, which, as a Web tool, has seriously limited use. In this paper we present a simple preprocessing scheme for Google Books Ngrams, enabling also search for an arbitrary n-gram (i.e., its associated statistics) in average time below 0.2 ms. The obtained compression ratio, with Deflate (zip) left as the backend coder, is over 3 times higher than in the original distribution.</div>
</front>
</TEI>
<affiliations><list><country><li>Pologne</li>
</country>
</list>
<tree><country name="Pologne"><noRegion><name sortKey="Grabowski, Szymon" sort="Grabowski, Szymon" uniqKey="Grabowski S" first="Szymon" last="Grabowski">Szymon Grabowski</name>
</noRegion>
<name sortKey="Swacha, Jakub" sort="Swacha, Jakub" uniqKey="Swacha J" first="Jakub" last="Swacha">Jakub Swacha</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000216 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000216 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:1A96B9FC68E740E8E445E06B684EE12892B17EDC |texte= Google Books Ngrams Recompressed and Searchable }}
This area was generated with Dilib version V0.6.32. |